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ABSTRACT 

Prosodic  patterns  provide  important  cues  for  resolving  syn¬ 
tactic  ambiguity,  and  might  be  used  to  improve  the  accu¬ 
racy  of  automatic  speech  understanding.  With  this  goal,  we 
propose  a  method  of  scoring  syntactic  parses  in  terms  of  ob¬ 
served  prosodic  cues,  which  can  be  used  in  ranking  sentence 
hypotheses  and  associated  parses.  Specifically,  the  score  is 
the  probability  of  acoustic  features  of  a  hypothesized  word 
sequence  given  an  associated  syntactic  parse,  based  on  acous¬ 
tic  and  “language”  (prosody /syntax)  models  that  represent 
probabilities  in  terms  of  abstract  prosodic  labels.  This  work 
reports  initial  efforts  aimed  at  extending  the  algorithm  to 
spontaneous  speech,  specifically  the  ATIS  task,  where  the 
prosody /parse  score  is  shown  to  improve  the  average  rank  of 
the  correct  sentence  hypothesis. 

1.  INTRODUCTION 

Human  listeners  bring  several  sources  of  information 
to  bear  in  interpreting  an  utterance,  including  syn¬ 
tax,  semantics,  discourse,  pragmatics  and  prosodic  cues. 
Prosody,  in  particular,  provides  information  about  syn¬ 
tactic  structure  (via  prosodic  constituent  structure)  and 
information  focus  (via  phrasal  prominence),  and  is  en¬ 
coded  in  the  acoustic  signal  in  terms  of  timing,  energy 
and  intonation  patterns.  Since  computer  knowledge  rep¬ 
resentations  are  not  as  sophisticated  as  human  knowl¬ 
edge,  utterances  that  are  straightforward  for  a  human  to 
interpret  may  be  “ambiguous”  to  an  automatic  speech 
understanding  system.  For  this  reason,  it  is  useful  to 
include  as  many  knowledge  sources  as  possible  in  auto¬ 
matic  speech  understanding,  and  prosody  is  currently  an 
untapped  resource.  In  fact,  some  syntactic  ambiguities 
can  be  resolved  by  listeners  from  prosody  alone  [1], 

One  way  to  incorporate  prosody  in  speech  understand¬ 
ing  is  to  score  the  expected  prosodic  structure  for  each 
candidate  sentence  hypothesis  and  syntactic  parse  in  re¬ 
lation  to  the  observed  prosodic  structure.  In  a  speech 
understanding  system  where  multiple  sentence  hypothe¬ 
ses  are  passed  from  recognition  to  natural  language  pro¬ 
cessing,  the  prosody/parse  score  could  be  used  to  rank 
hypotheses  and  associated  parses,  directly  or  in  combina¬ 
tion  with  other  scores.  The  parse  scoring  approach  was 
proposed  in  previous  work  [2],  where  automatically  de¬ 


tected  prosodic  phrase  breaks  were  scored  either  in  terms 
of  their  correlation  with  prosodic  structure  predicted 
from  parse  information  or  in  terms  of  their  likelihood 
according  to  a  probabilistic  prosody /syntax  model.  Re¬ 
cently,  the  parse  scoring  approach  was  reformulated  [3] 
to  avoid  explicit  recognition  of  prosodic  patterns,  which 
is  a  sub-optimal  intermediate  decision.  Specifically,  the 
new  score  is  the  probability  of  a  hypothesized  word  se¬ 
quence  and  associated  syntactic  parse  given  acoustic  fea¬ 
tures,  where  both  an  acoustic  model  and  a  “language” 
(prosody/syntax)  model  are  used  to  represent  the  proba¬ 
bility  of  utterance,  analogous  to  speech  recognition  tech¬ 
niques.  The  parse  scoring  formalism  was  also  extended 
to  incorporate  phrasal  prominence  information,  in  ad¬ 
dition  to  phrase  breaks.  In  previous  work,  we  demon¬ 
strated  the  feasibility  of  using  parse  scoring  to  find  the 
correct  interpretation  in  a  corpus  of  professionally  read 
ambiguous  sentences.  In  this  work,  we  use  the  parse  scor¬ 
ing  approach  to  rerank  a  speech  understanding  system’s 
N-best  output,  specifically  in  the  ATIS  task  domain,  in 
order  to  improve  sentence  understanding  accuracy. 

In  the  following  section,  we  describe  the  parse 
scoring  system  and  the  probabilistic  acoustic  and 
prosody/syntax  models.  Next,  we  discuss  issues  that 
arose  in  extending  the  parse  scoring  algorithm  to 
the  ATIS  task,  including  several  modifications  needed 
to  handle  new  problems  associated  with  spontaneous 
speech  and  the  new  parser  and  recognizer.  We  then 
present  experimental  results  for  the  task  of  reranking 
the  top  N  recognizer  hypotheses  and  associated  parses 
using  prosody/parse  scores.  Finally,  we  discuss  the  im¬ 
plications  of  the  results  for  future  work. 

2.  PARSE  SCORING 
2.1.  General  Formalism 

The  goal  of  this  work  is  to  reorder  the  set  of  N-best  recog¬ 
nizer  hypotheses  by  ranking  each  hypothesis  and  associ¬ 
ated  parse  in  terms  of  a  prosody  score.  More  specifically, 
the  prosody-parse  score  is  the  probability  of  a  sequence 
of  acoustic  observations  x  =  {xj , . .  .  ,x„}  given  the  hy¬ 
pothesized  parse,  p(x|parse),  where  x  is  a  sequence  of 
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duration  and  fO  measurements  associated  with  the  rec¬ 
ognizer  output.  We  compute  this  probability  using  an 
intermediate  phonological  representation  of  a  sequence 
of  abstract  prosodic  labels  a  =  {oi, . . . ,  an): 

p(x|parse)  =  p(x|a)p(a|parse).  (1) 

a 

This  representation  implies  the  development  of  two  prob¬ 
abilistic  models:  an  acoustic  model  of  prosodic  patterns, 
p(x|a),  and  a  model  of  the  relationship  between  prosody 
and  syntax  p(a|parse),  analogous  to  a  language  model  in 
speech  recognition. 

The  general  formalism  can  accommodate  many  types 
of  abstract  labels  in  the  prosodic  pattern  sequence  a. 
Here,  the  prosodic  labeling  scheme  is  an  extension  of 
that  proposed  in  [1]  and  includes  integer  break  indices, 
one  for  each  word  to  indicate  prosodic  constituent  struc¬ 
ture,  and  a  binary  indicator  of  presence  vs.  absence  of 
prominence  on  every  syllable.  Thus,  the  prosodic  la- 
bel  sequence  is  given  by  a  =  (b,p),  where  b  represents 
the  break  sequence  and  p  represents  the  prominence  se¬ 
quence.  To  simplify  the  current  implementation,  we  as¬ 
sume  b  and  p  are  independent.  This  assumption  implies 
the  use  of  two  acoustic  models,  p(x|b)  and  p(x|p),  and 
two  prosody /syntax  models,  p(b|parse)  and  p(p|parse). 
(Relaxation  of  the  independence  assumption  is  discussed 
in  Section  5.) 

Both  the  acoustic  and  prosody /syntax  models  make  use 
of  (different)  binary  decision  trees.  A  binary  decision 
tree  [4]  is  an  ordered  sequence  of  binary  questions  that 
successively  split  the  data,  ultimately  into  sets  associ¬ 
ated  with  the  tree’s  terminal  nodes  or  leaves.  Decision 
trees  are  particularly  useful  for  prosody  applications  be¬ 
cause  they  can  easily  model  feature  sets  with  both  cat¬ 
egorical  and  continuous  variables  without  requiring  in¬ 
dependence  assumptions.  During  training,  the  sequence 
of  questions  is  selected  from  a  specified  set  to  minimize 
some  impurity  criterion  on  the  sample  distribution  of 
classes  in  the  training  data.  For  typical  classification 
problems,  a  leaf  would  then  be  associated  with  a  class 
label.  In  this  work,  however,  leaves  are  associated  with 
the  posterior  distribution  of  the  classes  given  the  leaf 
node,  and  the  tree  can  be  thought  of  as  “quantizing” 
the  feature  vectors.  Here,  the  classes  are  either  the  dif¬ 
ferent  levels  of  breaks,  one  after  each  word,  or  the  binary 
prominence  labels,  one  for  each  syllable. 

2.2.  Acoustic  Model 

The  acoustic  models,  one  for  breaks  and  one  for  promi¬ 
nences,  are  based  on  decision  trees  originally  developed 
for  automatic  prosodic  labeling  [5,  6].  The  form  of  the 
two  models  is  essentially  the  same.  The  break  model,  for 


example,  represents  the  probability  distribution  of  the 
different  breaks  at  a  word  boundary  p(6|T>u(x)),  where 
7x»(*)  is  the  terminal  node  of  the  acoustic  break  tree 
corresponding  to  observation  x.  Assuming  the  observa¬ 
tions  are  conditionally  independent  given  the  breaks,  the 
probability  of  the  observation  sequence  is  given  by 


n 


n 


p(xib) = np»<) = n 

»=1  «=1 


p(fe.-|7xi(g,-))p(st-) 

p(&«) 


using  the  decision  tree  acoustic  model.  The  probability 
p(x|p)  is  computed  using  a  similar  formula  with  a  sepa¬ 
rate  acoustic  tree  Tap(x)  trained  to  model  prominence. 

The  key  differences  between  the  two  acoustic  models  are 
in  the  labels  represented  and  the  acoustic  features  used. 
The  break  model  represents  several  different  levels  of 
breaks,  while  the  prominence  model  represents  ±  promi¬ 
nence.  Breaks  are  associated  with  words  and  prominence 
markers  are  associated  with  syllables,  so  the  observa¬ 
tion  sequences  for  the  two  models  are  at  the  word  level 
and  syllable  level,  respectively.  Both  models  rely  on  fea¬ 
tures  computed  from  speech  annotated  with  phone  and 
word  boundary  markers  found  during  speech  recognition. 
Phonetic  segmentations  facilitate  the  use  of  timing  cues, 
that  in  this  work  are  based  on  segment  duration  normal¬ 
ized  according  to  phone-dependent  means  and  variances 
adapted  for  estimated  speaking  rate.  The  observation 
vectors  used  in  the  break  model  TAb  [5]  include  features 
associated  with  normalized  phone  duration  and  pause 
duration.  The  observation  vectors  used  to  model  promi¬ 
nence  Tap  [6]  include  similar  features,  as  well  as  FO  and 
energy  measurements. 

2.3.  Prosody /Syntax  Model 

The  break  and  prominence  prosody /syntax  models  are 
also  based  on  decision  trees,  in  this  case  originally  de¬ 
signed  for  synthesis  applications.  Hirschberg  and  col¬ 
leagues  have  proposed  the  use  of  decision  trees  to  predict 
presence  vs.  absence  of  prosodic  breaks  [7]  and  of  pitch 
accents  [8],  with  very  good  results.  Our  use  of  trees 
for  prosody /syntax  models  differs  from  this  work,  in  the 
number  of  prosodic  labels  represented,  in  the  use  of  trees 
to  provide  probability  distributions  rather  than  classifi¬ 
cation  labels,  and  in  the  use  of  trees  for  parse  scoring 
rather  than  prediction.  Again,  the  break  and  promi¬ 
nence  models  share  the  same  basic  form.  The  leaves 
of  the  prosody/syntax  break  tree  Tsb,  for  example,  are 
associated  with  a  probability  distribution  of  the  breaks 
given  the  syntactic  feature  vector  z,-,  p(f>|Tsj(z,)).  These 
probabilities  are  used  directly  in  computing  p(b|parse), 
assuming  the  breaks  are  conditionally  independent  given 
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the  quantized  features  Tsj(z,): 

n 

p(b|parse)  = 

i=i 

Again,  the  probability  p(p|parse)  can  be  computed  using 
the  same  approach  but  with  a  separate  prosody /syntax 
prominence  tree  Tsp. 

For  all  prosody /syntax  models,  the  feature  vectors  used 
in  the  tree  are  based  on  part-of-speech  tags  and  syn¬ 
tactic  bracketing  associated  with  the  hypothesized  word 
sequence.  For  the  break  model  Tsj,  the  feature  vec¬ 
tors  (one  for  each  word)  include  content/function  word 
labels,  syntactic  constituent  labels  at  different  levels  of 
bracketing,  measures  of  distance  in  branches  from  the 
top  and  the  bottom  of  the  syntactic  tree,  and  location 
in  the  sentence  in  terms  of  numbers  of  words.  For  the 
prominence  model  Tsp  [9],  the  feature  vectors  (one  for 
each  syllable)  include  part-of-speech  labels,  lexical  stress 
assignment  and  syllable  position  within  the  word. 


2.4.  Joint  Probability  Score 

Using  the  acoustic  and  prosody  /syntax  models  and  the 
independence  assumptions  described  above,  the  proba¬ 
bility  of  the  acoustic  observations  x  =  (x(6l,x(pl)  given 
an  hypothesized  parse  is: 

p(x|parse)  =  p(x^^|parse)p(x^p)|parse) 


where  the  break  models  contribute  to  the  term 

p(,<*  wi = ft  pc*,>  e 

and  the  prominence  models  contribute  a  similar  term.  If 
the  problem  is  to  rank  different  hypothesized  parses  for 
the  same  word  sequence,  i.e.,  the  same  observation  se¬ 
quence  x,  then  the  term  ]"[,•  p(x<)  can  be  neglected.  How¬ 
ever,  if  different  observation  sequences  are  being  com¬ 
pared,  as  is  the  case  for  different  recognition  hypothe¬ 
ses,  then  an  explicit  model  of  the  observations  is  needed. 
Since  the  acoustic  model  readily  available  to  this  effort 
does  not  provide  the  p(x,)  information,  we  simply  nor¬ 
malize  for  differences  in  the  length  of  the  word  sequence 
(nm)  and  of  the  syllable  sequence  (n»): 


Sj 
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The  score  given  by  Equation  2  differs  from  the  proba¬ 
bilistic  score  reported  in  previous  work  [2]  primarily  in 


that  it  uses  the  probability  of  breaks  at  each  word  bound¬ 
ary  rather  than  a  single  detected  break,  but  also  in  that 
it  incorporates  information  about  phrasal  prominence. 

3.  APPLICATION  TO  ATIS 

The  speech  corpus  is  spontaneous  speech  from  the  ATIS 
(Air  Travel  Information  Service)  domain,  collected  by 
several  different  sites  whose  efforts  were  coordinated  by 
the  MADCOW  group  [10].  The  ATIS  corpus  includes 
speech  from  human  subjects  who  were  given  a  set  of 
air  travel  planning  “scenarios”  to  solve  via  spoken  lan¬ 
guage  communication  with  a  computer.  Queries  made 
by  the  subjects  are  classified  differently  according  to 
whether  they  are  evaluable  in  isolation  (class  A),  require 
contextual  information  (class  D)  or  having  no  canonical 
database  answer  (class  X),  but  these  distinctions  are  ig¬ 
nored  in  our  work.  In  the  ATIS  task  domain,  speech 
understanding  performance  is  measured  in  terms  of  re¬ 
sponse  accuracy  with  a  penalty  for  incorrect  responses, 
as  described  in  [11].  Our  experiments  will  not  assess 
understanding  accuracy,  which  is  a  function  of  the  com¬ 
plete  speech  understanding  system,  but  rather  the  rank 
of  the  correct  answer  after  prosody /parse  scoring. 

A  subset  of  the  ATIS  corpus  was  hand-labeled  with 
prosodic  breaks  and  prominences  for  training  the  acous¬ 
tic  and  prosody /syntax  models.  Since  the  spoken  lan¬ 
guage  systems  at  the  various  data  collection  sites  differ 
in  their  degree  of  automation,  mode  of  communication, 
and  display,  the  training  subset  was  selected  to  represent 
a  balanced  sample  from  each  of  four  sites  (BBN,  CMU, 
MIT  and  SRI)  and  from  males  and  females.  The  Octo¬ 
ber  1991  test  set  is  used  in  the  experiments  reported  in 
Section  4. 

The  prosody/parse  scoring  mechanism  was  evaluated  in 
the  context  of  the  MIT  ATIS  system  [12],  which  com¬ 
municates  the  top  N  recognition  hypotheses  to  the  nat¬ 
ural  language  component  for  further  processing.  The 
speech  recognition  component,  the  SUMMIT  system, 
was  used  to  provide  phone  alignments  for  the  acoustic 
model.  The  SUMMIT  system  uses  segment-based  acous¬ 
tic  phone  models,  a  bigram  stochastic  language  model 
and  a  probabilistic  left-right  parser  to  provide  further 
linguistic  constraints  [12].  TINA,  MIT’s  natural  lan¬ 
guage  component  [13],  interleaves  syntactic  and  task- 
specific  semantic  constraints  to  parse  an  utterance.  As 
a  result,  the  parse  structure  captures  both  syntactic  and 
semantic  constituents.  For  example,  parse  tree  nodes 
may  be  labeled  as  CITY-NAME  or  FLIGHT-EVENT 
rather  than  with  general  syntactic  labels.  In  addition, 
TINA  falls  back  on  a  robust  parsing  mechanism  when 
a  complete  parse  is  not  found,  using  a  combination  of 
the  basic  parser  and  discourse  processing  mechanism  ap- 
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plied  within  the  utterance  [14].  The  robust  parser  en¬ 
ables  TINA  to  handle  many  more  queries,  which  may  be 
difficult  to  parse  because  they  contain  complex  and/or 
incomplete  syntactic  structures,  dislluencies,  or  simply 
recognition  errors.  The  robust  parser  assigns  constituent 
structure  to  as  much  of  the  utterance  as  possible  and 
leaves  the  unassigned  terminals  in  the  word  string,  and 
therefore  generates  bracketings  with  a  flatter  syntactic 
structure  than  that  for  a  complete  parse. 

In  order  to  port  our  models  and  scoring  algorithm  to  the 
ATIS  task,  the  first  change  needed  was  a  revision  to  the 
prosodic  labeling  system  to  handle  spontaneous  speech 
phenomena.  The  changes  included  the  addition  of  two 
markers  introduced  in  the  TOBI  prosodic  labeling  sys¬ 
tem  [15].  First,  the  diacritic  “p”  was  added  to  break 
indices  where  needed  to  indicate  that  an  exceptionally 
long  pause  or  lengthening  occurs  due  to  hesitation  [15]. 
As  in  our  previous  work,  we  used  a  seven  level  break 
index  system  to  represent  levels  in  a  constituent  hierar¬ 
chy,  a  superset  of  the  TOBI  breaks.  (The  binary  accent 
labels  represent  a  simplification  or  core  subset  of  the 
TOBI  system.)  The  “p”  diacritic  is  used  fairly  often:  on 
5%  of  the  total  breaks,  on  14%  of  the  breaks  at  levels 
2  and  3,  and  somewhat  more  often  in  utterances  that 
required  a  robust  parse.  In  addition,  a  new  intonational 
marker,  %r,  was  added  to  indicate  the  beginning  of  an 
intonational  phrase  when  the  previous  phrase  did  not 
have  a  well-formed  terminus,  e.g.  in  the  case  of  repairs 
and  restarts.  The  %r  marker  was  rarely  used  and  there¬ 
fore  not  incorporated  in  the  models.  Two  other  prosodic 
“break”  labels  were  added  to  handle  problems  that  arose 
in  the  ATIS  corpus:  “L”  for  linking  was  added  for  mark¬ 
ing  the  boundaries  within  a  lexical  item  (e.g.  San  L 
Francisco)  and  “X”  for  cases  where  the  labelers  did  not 
want  to  mark  a  word  boundary  between  items  (e.g.  af¬ 
ter  an  interrupted  word  such  as  fli-).  The  different  break 
markers  were  grouped  in  the  following  classes  for  robust 
probability  estimates  in  acoustic  modeling:  (0,1, L),  2,  3, 
4-5,  6,  (2p,3p),  and  (4p,5p).  In  these  experiments,  the 
relatively  few  sentences  with  an  “X”  break  were  simply 
left  out  of  the  training  set. 

Another  new  problem  introduced  by  the  ATIS  task  was 
the  definition  of  a  “word”,  an  important  issue  because 
prosodic  break  indices  are  labeled  at  each  word  bound¬ 
ary.  The  human  labelers,  the  SUMMIT  recognition  sys¬ 
tem  and  the  TINA  natural  language  processing  system 
all  used  different  lexicons,  differing  on  the  definition  of 
a  “compound  word”  (e.g.  air-fare,  what-is-the).  These 
differences  were  handled  in  training  by:  defining  word 
boundaries  according  to  the  smallest  unit  marked  in  any 
of  the  three  systems,  using  the  MIT  lexicons  to  associate 
the  parse  and  recognition  word  boundaries,  and  assign¬ 


ing  any  hand-labeled  “L”  breaks  to  “1”  where  the  rec¬ 
ognizer  or  parser  indicated  a  word  boundary.  In  testing, 
only  the  mapping  between  the  recognition  and  natural 
language  components  is  needed,  and  again  the  smallest 
word  units  are  chosen. 

The  main  changes  to  the  acoustic  model  in  moving  to 
the  ATIS  task  were  associated  with  the  particular  phone 
inventory  used  by  the  SUMMIT  system.  The  differences 
in  the  phone  inventory  resulted  in  some  minor  changes 
to  the  syllabification  algorithm  (syllable  boundaries  are 
needed  for  acoustic  feature  extraction).  In  addition,  the 
phone  label  set  was  grouped  into  classes  for  estimating 
robust  duration  means  and  variances.  We  also  revised 
the  pause  duration  feature  to  measure  the  total  duration 
of  all  interword  symbols. 

The  changes  to  the  prosody/syntax  model  simply  in¬ 
volved  defining  new  questions  for  the  decision  tree  de¬ 
sign.  The  first  change  involved  introducing  new  cate¬ 
gories  of  parse  tree  bracketing  labels,  in  part  to  handle 
the  different  naming  conventions  used  in  TINA  and  in 
part  to  take  advantage  of  the  semantic  information  pro¬ 
vided  by  TINA.  In  addition,  new  types  of  questions  were 
added  to  handle  cases  that  included  non-branching  non¬ 
terminals,  specifically,  questions  about  the  full  level  of 
bracketing  and  the  bracketing  defined  only  by  binary 
branching  non-terminals  (i.e.,  using  two  definitions  of 
the  “bottom”  of  the  syntactic  tree)  and  questions  about 
the  non-terminal  labels  at  multiple  levels.  Because  of  the 
differences  in  syntactic  structure  for  word  strings  associ¬ 
ated  with  a  robust  parse  as  opposed  to  a  complete  parse, 
we  chose  to  model  the  prosody  of  breaks  given  a  robust 
parse  separately,  which  is  equivalent  to  forcing  the  first 
branch  of  the  tree  to  test  for  the  use  of  the  robust  parser. 

In  summary,  many  changes  were  necessary  in  porting  the 
algorithm  to  ATIS,  some  of  which  were  required  by  the 
task  of  understanding  spontaneous  speech  while  others 
were  specific  to  the  particular  recognizer  and  parser  used 
here. 

4.  EXPERIMENTS 

In  the  experimental  evaluation  of  the 
prosody/parse  scoring  algorithm  on  ATIS,  the  acoustic 
and  prosody/syntax  models  were  trained  on  the  subset 
of  ATIS  utterances  that  were  hand-labeled  with  prosodic 
markers.  The  acoustic  model  was  trained  from  phonetic 
alignments  provided  by  the  MIT  recognizer,  where  the 
recognizer  output  was  constrained  to  match  the  tran¬ 
scribed  word  sequence.  The  prosody/syntax  model  was 
trained  from  TINA  parses  of  the  transcribed  word  se¬ 
quence. 

For  the  parse  scoring  experiments,  MIT  provided  the  N 
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best  recognition  hypotheses  and  one  parse  per  hypothe¬ 
sis  for  each  utterance  in  the  October  1991  test  set.  The 
sentence  accuracy  rate  of  the  top  recognition  hypothe¬ 
sis,  before  any  prosodic  or  natural  language  processing, 
was  32%.  We  rescored  the  top  10  hypotheses,  choos¬ 
ing  the  same  number  used  by  the  current  version  of  the 
MIT  ATIS  system.  185  of  383  utterances  (48%)  included 
the  correct  word  string  in  the  top  10.  Excluding  a  few 
other  sentences  because  of  processing  difficulties,  a  total 
of  179  utterances  were  used  in  evaluating  improvements 
in  rank  due  to  prosody.  For  each  sentence  hypothesis, 
we  extracted  a  sequence  of  acoustic  features  from  the 
phone  alignments  and  F0  contours  and  a  sequence  of 
syntactic  features  from  the  associated  parse.  Thus,  every 
utterance  yielded  ten  sequences  of  acoustic  observation 
vectors  and  ten  associated  sequences  of  parse  features, 
one  pair  for  each  of  the  ten-best  hypothesized  word  se¬ 
quences.  Each  observation  sequence  was  then  scored  ac¬ 
cording  to  the  syntactic  structure  of  the  corresponding 
parse,  yielding  p(x,- [parse,),  i  =  1, . . . ,  10  for  each  ut¬ 
terance. 

The  prosody/parse  score  was  used  as  one  component 
in  a  linear  combination  of  scores,  also  including  the 
MIT  SUMMIT  acoustic  score  and  language  model  score, 
which  was  used  to  rerank  the  sentence  hypotheses.  We 
investigated  the  use  of  a  combined  prosody  score  and 
separate  break  and  prominence  scores,  and  separating 
the  scores  gave  slightly  better  performance.  The  weights 
in  the  linear  combination  are  estimated  on  the  October 
1991  data,  using  the  method  reported  in  [16].  (Although 
this  is  not  a  fair  test  in  the  sense  that  we  are  train¬ 
ing  the  three  weights  on  the  test  set,  our  experiments 
in  recognition  indicate  that  performance  improvements 
obtained  typically  translate  to  improvements  on  inde¬ 
pendent  test  sets.)  The  acoustic  scores  were  normalized 
by  utterance  length  in  frames,  and  the  other  scores  by 
utterance  length  in  words.  We  compared  the  rankings 
of  the  correct  word  string  for  the  score  combination  us¬ 
ing  only  the  MIT  acoustic  and  language  scores  with  the 
rankings  according  to  the  score  combination  that  also 
used  the  prosody/parse  probability.  The  average  rank 
of  the  correct  utterance,  for  those  in  the  top  10  to  be¬ 
gin  with,  moved  from  1.87  without  the  prosody  score  to 
1 .67  with  the  prosody  score,  a  gain  of  about  23%  given 
that  the  best  rank  is  1.0.  A  paired  difference  test  in¬ 
dicates  that  the  difference  in  performance  is  significant 
(ia  =  2.47,  a/2  <  .005).  In  addition,  we  noticed  that 
incorporation  of  the  prosody  score  rarely  dropped  the 
rank  of  the  correct  sentence  by  more  than  one,  whereas 
it  often  improved  the  rank  by  more  than  one. 


5.  DISCUSSION 

In  summary,  we  have  described  a  prosody/parse  scor¬ 
ing  criterion  based  on  the  probability  of  acoustic  obser¬ 
vations  given  a  candidate  parse.  The  model  is  general 
enough  to  handle  a  variety  of  prosodic  labels,  though  we 
have  focused  here  on  prosodic  breaks  and  prominences. 
Motivated  by  the  good  results  in  previous  experiments 
with  this  algorithm  on  professionally  read  speech,  the 
goal  of  this  work  was  to  extend  the  model  to  spontaneous 
speech  and  evaluate  its  usefulness  in  the  context  of  an 
actual  speech  understanding  system,  i.e.  the  MIT  ATIS 
system.  Experimental  results  indicate  that  prosody  can 
be  used  to  improve  the  ranking  of  the  correct  sentence 
among  the  top  N.  We  expect  the  improved  ranking  will 
translate  to  improved  understanding  accuracy,  though 
clearly  this  needs  to  be  confirmed  in  experiments  with  a 
spoken  language  system. 

There  are  several  alternatives  for  improving  both  the 
acoustic  and  prosody /syntax  models.  In  particular,  the 
current  score  uses  a  heuristic  to  account  for  differences 
in  observation  sequences,  which  could  be  better  handled 
by  explicitly  representing  p(x|a)  rather  than  the  pos¬ 
terior  probability  p(a|x)  in  the  acoustic  model.  Other 
possible  extensions  include  relaxation  of  independence 
assumptions,  in  particular  the  independence  of  breaks 
and  prominences,  since  other  work  [9]  has  shown  that 
breaks  are  useful  for  predicting  prominence.  Of  course, 
this  would  require  increased  amounts  of  training  data 
and  somewhat  more  complex  algorithms  for  computing 
the  parse  score.  Finally,  these  experiments  represent 
initial  efforts  in  working  with  the  MIT  recognizer  and 
parser,  and  new  acoustic  and  syntactic  features  might 
take  better  advantage  of  the  MIT  system. 

The  parse  scoring  algorithm  is  trained  automatically  and 
is  in  principal  easily  extensible  to  other  tasks  and  other 
speech  understanding  systems.  However,  our  effort  to 
evaluate  the  algorithm  in  the  ATIS  domain  raised  some 
issues  associated  with  portability.  New  prosodic  labels 
were  added  to  accommodate  hesitation  and  disfluency 
phenomena  observed  in  spontaneous  speech,  a  problem 
that  we  expect  will  diminish  as  prosodic  labeling  conven¬ 
tions  converge.  Problems  arose  due  to  the  differences  in 
the  definition  of  a  “word”  among  component  modules  in 
the  system,  which  might  be  addressed  by  standardization 
of  lexical  representation  and/or  by  additional  changes  to 
prosodic  labeling  conventions.  Finally,  the  specific  choice 
of  questions  used  in  the  decision  trees  was  determined  in 
part  by  hand  to  accommodate  the  output  “vocabulary” 
of  the  particular  recognizer  and  parser  used.  Though 
this  aspect  could  be  completely  automated  by  creating 
standards  for  parse  trees  and  recognizer  “phone”  labels, 
the  use  of  some  hand-tuning  of  questions  allows  us  to  op- 
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timize  performance  by  taking  advantage  of  the  features 
of  different  systems  and  knowledge  of  the  task  domain. 

Clearly,  performance  in  different  spoken  language  sys¬ 
tems  will  be  affected  by  several  factors,  including  the 
reliability  and  level  of  detail  of  the  parser,  the  accu¬ 
racy  of  the  recognizer,  the  types  of  ambiguities  in  the 
task  domain  and  the  sophistication  of  other  knowledge 
sources  (e.g.  semantic,  discourse)  in  the  system.  We  plan 
to  explore  these  issues  further  by  assessing  performance 
of  the  algorithm  in  the  SRI  ATIS  system.  (Of  course, 
it  may  be  that  the  constrained  semantics  of  the  ATIS 
task  make  it  difficult  to  assess  the  potential  benefits  of 
prosodic  information.)  Implementation  and  evaluation 
of  prosody /parse  scoring  in  the  two  systems  should  have 
implications  for  spoken  language  system  design,  and  our 
initial  work  already  raises  some  issues.  In  particular, 
there  are  cases  where  prosody  could  benefit  speech  un¬ 
derstanding,  but  is  not  useful  unless  the  natural  lan¬ 
guage  component  provides  more  than  one  parse  for  a 
hypothesized  word  string,  e.g.  for  lists  of  numbers  and 
for  utterances  with  possible  disfluencies.  In  addition,  it 
might  be  useful  to  have  explicit  filled  pause  models  used 
in  recognition  (a  capability  available  in  some  versions  of 
the  MIT  system  that  was  not  used  in  this  experiment), 
to  help  distinguish  hesitations  (marked  by  the  “p”  dia¬ 
critic)  from  well-formed  prosodic  boundaries. 

In  conclusion,  we  emphasize  that  these  experiments  rep¬ 
resent  initial  efforts  at  integrating  prosody  in  speech  un¬ 
derstanding  and  there  is  clearly  much  more  work  to  be 
done  in  this  area.  In  addition  to  improving  the  basic 
components  of  the  model  and  evaluating  more  parse  hy¬ 
potheses,  there  are  many  other  possible  architectures 
that  might  be  investigated  for  integrating  prosody  in 
speech  understanding. 
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